Combine narrative with numbers for effective storytelling in R

Martin Frigaard

Setup

Link to scenario: https://www.katacoda.com/orm-mfrigaard/scenarios/03-effective-storytelling

The index.json configuration file is set using the following environment:

"environment": {
  "uilayout": "editor-terminal-split"
},
"backend": {
  "imageid": "rlang:3.4"
}

Outline

Below are the 20 steps (plus intro and finish) files in the scenario.

Objectives

The objectives for this scenario are:

  1. recognize the needs of your audience (data literacy, level of expertise, etc.)

  2. check and communicate data quality with stakeholders

  3. identify the correct data visualization (single variable, bivariate, and multivariate graphs) based on the data

  4. incorporate feedback from stakeholders/audience into graphs

  5. design visualizations with the appropriate detail and annotations that inform (and do not mislead) the audience

The learners

The learners I’m expecting to be participating in this course are:

  • Jessica will take this class on her own time for professional development.

  • Peter will this course in a two-day weekend because he needs to complete a project.

  • Bruce is using these scenarios to supplement a semester-long undergraduate course on R.

  • Jane has been told to take this course for his job because his team is using R.

intro

  • included in intro.md?

Welcome!

Welcome to ‘Combine narrative with numbers for effective storytelling in R’! In this scenario, we will cover how to build data visualizations that effectively communicate and engage your audience.

Now that we have some experience with data wrangling with tidyr and dplyr, and data visualization with ggplot2, we can put these tools together to get your point across to stakeholders and audiences.

What makes a bad graphic?

Bad graphs aren’t just ugly; they’re misleading. A chart can have pretty colors and novel font styling, but that can’t make up for inaccurately presenting data. Mislabeling axes, using inappropriate scales or labels, and including unnecessary elements (‘chart junk’) are common characteristics of bad graphics.

We’ll be using the terms ‘graph’, ‘figure,’ and ‘chart’ interchangeably throughout this scenario.

Claus Wilke describes the difference between ugly, bad, and wrong graphs in his excellent text, Fundamental of Data Visualization.

  • An ugly graph isn’t aesthetic appealing, but the data presented is clear and informative
  • A bad graph fails to communicate the information it contains because of poor design
  • A graph is wrong if it’s mathematically incorrect (the underlying calculations, representations, or transformations are inaccurate).

We want to avoid making ugly, bad, and wrong graphs.

What makes a great graphic?

Have you ever heard a joke and thought, “it’s funny because it’s true”?

We all know a great data visualization when we see it, but can you explain why it’s so great? Beyond just being aesthetically appealing (the choice of color palettes, fonts, etc.), data visualizations are tools for communicating complicated information.

Well, great graphics are similar to great jokes in this way–they should reveal a complicated ‘truth’ that was otherwise difficult to comprehend or articulate with words alone.

This scenario will show you how to make sure your graphs and figures communicate complexity effectively without misleading your audience.

step 1

  • included in step1.md?
  • grammar?
  • spelling?

Things to consider about your audience

You’ll need to determine who the audience or stakeholders will be before creating graphs or figures. You’ll likely create multiple charts throughout a data analysis that you won’t include as a final deliverable. But in most cases, our stakeholders or audience is whoever is getting the final results or product.

The final graphs you produce will depend on 1) the question we’re trying to answer, and 2) the level of statistical literacy of our audience.

If your audience isn’t familiar with particular data visualizations, provide them with enough information to interpret the graph (and check their understanding). However, if you find yourself spending more time explaining a data visualization’s design than what the visualization reveals, we should consider a different graph.

Asking questions

Data visualization should be an iterative process, and getting regular feedback from your audience will help you understand their point of view. It will also help manage their expectations regarding the final deliverable.

As Hilary Mason and D.J. Patil point out in their 2015 text, Data Driven: Creating a Data Culture, asking the right questions “involves domain knowledge and expertise, coupled with a keen ability to see the problem, see the available data, and match up the two.

The questions below can help guide your project and make sure you understand what your audience/stakeholders are expecting:

  1. What question(s) is this project trying to answer? (or What problem(s) is this project trying to solve?)
  2. Do we have access to the data to answer the question/problem posed? (or do we need to gather more data?)
  3. What is the current format/structure/location of the data? (this will have an enormous impact on the project timeline!)
  4. What context will we present the deliverable(s) in? (slide deck, website, report, etc.)
  5. How familiar will the audience be with the data in our project? (how much background information should we be providing?)

We recommended having the answers to these questions documented somewhere, as this will 1) keep your project focused and timely, 2) ensure both you and your client/customer have a clear vision for successful completion.

You want to fully understand what you are visualizing, who the audience will be, and why they will care about the results.

step 2

  • included in step2.md?
  • grammar?
  • spelling?

Data lineage: the background on our data

It’s best to start a project off with a ‘view of the forest from outside the trees’. The technical term for this is data lineage, which

“includes the data origin, what happens to it, and where it moves over time.”

Having a “birds” eye view’ of the data ensures there weren’t any problems with exporting or importing. Data lineage also means understanding where the data are coming from–was it collected from an internal relational database, an external vendor, or did it come from the web or social media?

Knowing some of the technical details behind a dataset lets us frame the questions or problems we’re trying to tackle. In this scenario, we will use a variety of data sources, but they will all be [tabular](https://en.wikipedia.org/wiki/Table_(information). Tabular data, like spreadsheets, organize data into columns and rows. R can handle multiple kinds of data, but that is a topic that extends beyond the scope of this scenario.

Initiate R

Let’s load some data and get started! Launch an R console by clicking here -> R (Click on the Run command icon)

Load packages

The package we’ll use to view the entire datasets with R is skimr. We will install and load these packages below:

install.packages(c("tidyverse", "skimr"))
library(tidyverse)
library(skimr)

step 3

  • included in step3.md?
  • grammar?
  • spelling?

Before you start: what do we expect to see?

Before starting a new project, we want to set some expectations. The questions we covered in the previous step help us understand what kind of data we’ll be encountering. Sometimes we’ll be dealing with unknown data, but we should know approximately how many columns and rows the new dataset will contain. We might know some basic information about the variable formats, too.

For example, we should see if we’re getting date columns (YYYY-MM-DD), logical (TRUE, FALSE, NA), numerical measurements (integer (1L) or double (1)), or categorical data (character (male and female) or factor (’low,medium,high`)).

Baseball data

We’re going to load a dataset to demonstrate a few ways to investigate a dataset’s quality (or how well it matches our expectations).

These data come from Sean Lahman’s Baseball Database.

Now, I am not going to assume everyone participating in this scenario is familiar with baseball. However, this exercise is arguably more rewarding if you are not a baseball fan. If you’re working with data, part of your job to be interested in whatever you’ve been asked to analyze (even if it is only for the monetary reward).

“…if you want to work in data visualisation, you need to be relentlessly and systematically curious. You should try to get interested in anything and everything that comes your way.” - Alberto Cairo, Knight Chair in Visual Journalism, University of Miami

Analyzing and visualizing data you’re not familiar with is a chance to learn something new, and it puts you in a position to ask ‘out of the box’ questions.

Doing your homework

It’s also essential to read any accompanying documentation for new datasets. If we read the documentation on the Lahman website, we find out that People contains “Player names, DOB, and biographical info.” The variables in People are presented below:

People table

playerID = A unique code assigned to each player. The playerID links the data in this file with records in the other files. birthYear = Year player was born
birthMonth = Month player was born
birthDay = Day player was born
birthCountry = Country where player was born
birthState = State where player was born
birthCity = City where player was born deathYear = Year player died
deathMonth = Month player died
deathDay = Day player died
deathCountry = Country where player died
deathState = State where player died
deathCity = City where player died
nameFirst = Player’s first name
nameLast = Player’s last name
nameGiven = Player’s given name (typically first and middle)
weight = Player’s weight in pounds
height = Player’s height in inches
bats = Player’s batting hand (left, right, or both)
throws = Player’s throwing hand (left or right)
debut = Date that player made first major league appearance
finalGame = Date that player made first major league appearance (blank if still active)
retroID = ID used by retrosheet
bbrefID = ID used by Baseball Reference website

Most of the data pre-processing steps center around a single question: Is this what I expected to see? Reading the documentation gives you expectations about the data to confirm or refute (and then investigate).

Now that we have some background information on this new dataset, we will look at how well People meets our expectations.

Whenever we get a new data source, we should try to view the data in its native format (if possible). We can view the raw data on the Github repository.

Load data

Fortunately, we are also able to load the raw data directly into R using the readr::read_csv() function. We will load the People dataset into R using readr::read_csv(), and assign "https://bit.ly/3scsHw7" to the file argument.

People <- readr::read_csv(file = "https://bit.ly/3scsHw7")

step 4

  • included in step4.md?
  • grammar
  • spelling

Are we seeing what we expected? (1)

Before creating any visualizations, we want a display that gives us an overview of the entire People dataset. In the previous step, we went over some of the People dataset documentation, so we know what to expect.

Skimming data

We’ll be using the skimr package. skimr was is designed for:

“displaying summary statistics the user can skim quickly to understand their data”

Below we pass the People dataset to the skimr::skim() function to create PeopleSkim. We then use the base::summary() function to review the new object.

If this code looks unfamiliar to you, review the Introduction to ggplot2 scenario.

PeopleSkim <- People %>%  
  skimr::skim()
summary(PeopleSkim)
Data summary
Name Piped data
Number of rows 20090
Number of columns 24
_______________________
Column type frequency:
character 14
Date 2
numeric 8
________________________
Group variables None

The output above shows a high-level summary of all the variables in the People dataset. We can see there are 20090 rows and 24 columns (‘14’ columns are character's,2columns areDate’s, and 8 are numeric).

Viewing character variables

The new PeopleSkim object gives us summary information to check against the documentation and help guide our data visualizations. We will start by viewing the variables according to their types in People using skimr::yank() (read the function documentation on Github). The skim_type argument in skimr::yank() takes a variable type ("character", "numeric", or "Date").

Run the code below to use skimr::yank() to view a skim of the character variables in the People dataset.

PeopleSkim %>% 
  skimr::yank(skim_type = "character")

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
playerID 0 1.00 5 9 0 20090 0
birthCountry 61 1.00 3 14 0 57 0
birthState 535 0.97 2 22 0 295 0
birthCity 174 0.99 3 26 0 4880 0
deathCountry 10235 0.49 3 14 0 25 0
deathState 10285 0.49 2 20 0 107 0
deathCity 10240 0.49 2 26 0 2676 0
nameFirst 37 1.00 2 14 0 2522 0
nameLast 0 1.00 2 14 0 10219 0
nameGiven 37 1.00 2 43 0 13325 0
bats 1180 0.94 1 1 0 3 0
throws 977 0.95 1 1 0 3 0
retroID 56 1.00 8 8 0 20034 0
bbrefID 2 1.00 5 9 0 20088 0

We can see none of these data are missing (n_missing and complete_rate). Skimr::skim() also shows us the min, max, empty, n_unique, and whitespace for the 14 character values.

Viewing date variables

Next, we use skimr::yank() to view a skim of the Date variables in the People dataset.

PeopleSkim %>% 
  skimr::yank(skim_type = "Date")

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
debut 198 0.99 1871-05-04 2020-09-27 1965-04-14 10572
finalGame 198 0.99 1871-05-05 2020-09-27 1971-05-06 9480

The skim of the Date variables shows us which data are missing (n_missing and complete_rate), along with the earliest (min), latest (max), and middle (median).

The number of unique (n_unique) dates prints to the next line. This behavior is because the terminal window has a width limit. If the Terminal output extends past this limit, the content gets printed to the line below.

Do these numbers make sense?

We can use these values for sanity checks. For example, the n_unique for playerID matches the total number of rows in People, which we should expect from the documentation (playerID = “A unique code assigned to each player”). The earliest dates for both debut and finalGame are in May of 1871 (which corresponds to the first MLB game ever played).

step 5

  • included in step5.md?
  • grammar
  • spelling

Are we seeing what we expected? (2)

In the previous step, we viewed a skim() of the "character" and "Date" variables in the People dataset. We’re going to continue ‘skimming’ these data and check them against our expectations.

Viewing numeric variables

We’ll use skimr::yank() and skimr::focus() to view the n_missing and complete_rate for the "numeric" variables in People.

PeopleSkim %>% 
  focus(n_missing, complete_rate) %>% 
    yank("numeric")

Variable type: numeric

skim_variable n_missing complete_rate
birthYear 114 0.99
birthMonth 282 0.99
birthDay 424 0.98
deathYear 10232 0.49
deathMonth 10233 0.49
deathDay 10234 0.49
weight 817 0.96
height 737 0.96

The complete_rate for birthYear, birthMonth, birthDay, weight and height are over 90%. However, the deathYear, deathMonth and height is under 50%. Why do you think these data have missing values?

The skim() output for the "numeric" variables give us a set of summary statistics:

Location statistics

  • the mean (or average) gives us the expected value for each variable
  • the median (as p50) or the ‘center’ value for each variable. Half of the values are above, and half are below.
PeopleSkim %>% 
  skimr::focus(numeric.mean, numeric.p50) %>% 
    skimr::yank("numeric") 

Variable type: numeric

skim_variable mean p50
birthYear 1934.42 1942
birthMonth 6.63 7
birthDay 15.62 16
deathYear 1966.29 1968
deathMonth 6.49 6
deathDay 15.53 15
weight 187.80 185
height 72.34 72

Spread statistics

  • the lowest value for each variable, or minimum (as p0)
  • the highest value for each variable, or maximum (as p100)
  • Together, these two values can give us the range, which is the difference between the maximum and minimum values
PeopleSkim %>% 
  skimr::focus(numeric.p0, numeric.p100) %>% 
    skimr::yank("numeric") 

Variable type: numeric

skim_variable p0 p100
birthYear 1820 2000
birthMonth 1 12
birthDay 1 31
deathYear 1872 2020
deathMonth 1 12
deathDay 1 31
weight 65 2125
height 43 83
  • the first quartile (as p25), which is the ‘middle’ of the data points below the median
  • the third quartile (as p75), which is the ‘middle’ of the data points above the median
  • Together, these two values can give us the interquartile range (IQR), which is the difference between the third and first quartiles
PeopleSkim %>% 
  skimr::focus(numeric.p25, numeric.p75) %>% 
    skimr::yank("numeric") 

Variable type: numeric

skim_variable p25 p75
birthYear 1896 1973
birthMonth 4 10
birthDay 8 23
deathYear 1943 1993
deathMonth 3 10
deathDay 8 23
weight 172 200
height 71 74
  • the standard deviation (as sd), a measure of each variable’s disbursement.
  • The standard deviation describes how far a variable’s values are spread out around their mean
PeopleSkim %>% 
  skimr::focus(numeric.mean, numeric.sd) %>% 
    skimr::yank("numeric") 

Variable type: numeric

skim_variable mean sd
birthYear 1934.42 42.73
birthMonth 6.63 3.47
birthDay 15.62 8.76
deathYear 1966.29 32.95
deathMonth 6.49 3.53
deathDay 15.53 8.79
weight 187.80 26.33
height 72.34 2.61

These numbers can be challenging to make sense of by themselves. Fortunately, the skimr::skim() output comes with a hist column. The hist column is a small histogram for the numeric variables.

Below we use skimr::focus() and skimr::yank() to view the mean, standard deviation (sd), minimum (p0), median (p50), maximum (p100), and hist for the numeric variables in People.

PeopleSkim %>% 
  skimr::focus(numeric.mean, numeric.sd, 
               numeric.p0, numeric.p50, numeric.p100,
               numeric.hist) %>% 
    skimr::yank("numeric") 

Variable type: numeric

skim_variable mean sd p0 p50 p100 hist
birthYear 1934.42 42.73 1820 1942 2000 ▁▅▅▆▇
birthMonth 6.63 3.47 1 7 12 ▇▅▅▆▇
birthDay 15.62 8.76 1 16 31 ▇▇▇▇▆
deathYear 1966.29 32.95 1872 1968 2020 ▁▃▆▇▇
deathMonth 6.49 3.53 1 6 12 ▇▅▅▅▇
deathDay 15.53 8.79 1 15 31 ▇▇▇▆▆
weight 187.80 26.33 65 185 2125 ▇▁▁▁▁
height 72.34 2.61 43 72 83 ▁▁▁▇▁

The hist column shows us a miniature distribution of the values in each numeric variable.

Do these numbers make sense?

  • As we can see, the majority of the missing values are in the variables with the death prefix (deathDay, deathMonth, and deathYear). The missing values in these variables make sense because, given the lowest birthYear value (1820), we should expect approximately half of the baseball players in the People dataset to be still alive.

  • We also notice an implausible value from the skimr output: the weight variable maximum value (2125). We can use dplyr’s filter and select functions to find the nameGiven for the abnormally high weight value.

People %>% 
  filter(weight == 2125) %>% 
  select(nameGiven, birthMonth, birthDay, birthYear, weight)
#> # A tibble: 1 x 5
#>   nameGiven    birthMonth birthDay birthYear weight
#>   <chr>             <dbl>    <dbl>     <dbl>  <dbl>
#> 1 Jacob Robert         10       28      1996   2125

Google the player’s name. What is his listed weight on Wikipedia?

step 6

  • included in step6.md?
  • grammar
  • spelling

Counting things

“Data science is mostly counting things.” - Sam Firke

Data visualizations are drawings made with numbers. The challenge is picking the best image for the numbers you want to show. Before you can choose what you want to draw, you need to decide which numbers you’d like to display.

Column/bar charts

In a bar (or column) chart, each bar/column length represents a numeric value. The number of levels determines the number of bars or columns.

We will create a bar chart of the bats variable in People, which measures whether the player bats left-handed (L), right-handed (R), both (B), or if these data are missing (NA). Below we’ll use dplyr ’s count() function to tally the number of values for the different category items in bats.

People %>% 
  count(bats, sort = TRUE)
#> # A tibble: 4 x 2
#>   bats      n
#>   <chr> <int>
#> 1 R     12435
#> 2 L      5246
#> 3 B      1229
#> 4 <NA>   1180

In ggplot2, we create a bar chart using the geom_bar() function. First we map bats to both x and fill inside the ggplot(aes()) functions. If you need a refresher on ggplot2 layers and mapping, check out the previous scenario.

We also remove the legend with guides(fill = FALSE), and add labels for title, subtitle, caption, and y axis (x is set to NULL).

# click to execute code
gg_step6_bar_01 <- People %>% 
  ggplot(aes(x = bats, fill = bats)) + 
  geom_bar() + 
  guides(fill = FALSE) +
  labs(title = "MILB Player's batting hand",
       subtitle = "Left (L), right (R), or both (B)",
       caption = "source: http://www.seanlahman.com/",
       x = NULL, y = "Number of birth countries")
# save
# ggsave(plot = gg_step6_bar_01,
#        filename = "gg-step6-bar-01.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step6_bar_01

We will need to open the gg-step6-bar-01.png graph in the vscode IDE (above the Terminal console).

Counting top 10 birth countries

geom_bar() only takes a single categorical (or factor) variable. Sometimes we’ll need to specify the variable we want to map to the y axis. For example, the code below uses dplyr::filter() and dplyr::count() to get return top 10 non-US ‘Country of birth’ for players in the People dataset. We then use utils::head() to return the top 10 rows, and dplyr::mutate() with forcats::fct_inorder() to change the format of the birthCountry variable to a factor.

People %>% 
  filter(birthCountry != "USA" & !is.na(birthCountry)) %>% 
  count(birthCountry, sort = TRUE) %>% 
  head(10) %>% 
  mutate(birthCountry = fct_inorder(birthCountry))
#> # A tibble: 10 x 2
#>    birthCountry       n
#>    <fct>          <int>
#>  1 D.R.             791
#>  2 Venezuela        425
#>  3 P.R.             270
#>  4 CAN              255
#>  5 Cuba             223
#>  6 Mexico           136
#>  7 Japan             70
#>  8 Panama            65
#>  9 United Kingdom    52
#> 10 Ireland           50

In this case, we have two variables in this output: birthCountry and ‘n’. If we’re going to build a graph from these data, we know we’ll need a way to represent both the country’s name and the number of times it occurs.

For this graph, we will need to use ggplot2'sgeom_col()` function (we will include the code from above, which creates a dataset with only the top 10 non-US birth countries).

We also map birthCountry to fill, remove the legend with guides(fill = FALSE), and add a label for the y axis with ggplot2::labs().

# click to execute code
gg_step6_col_01 <- People %>% 
  filter(birthCountry != "USA" & !is.na(birthCountry)) %>% 
  count(birthCountry, sort = TRUE) %>% 
  head(10) %>% 
  mutate(birthCountry = fct_inorder(birthCountry)) %>%  
  ggplot(aes(x = birthCountry, y = n, fill = birthCountry)) + 
  geom_col() +
  guides(fill = FALSE) +
  labs(title = "Top 10 Non-US birth countries for MLB players",
       subtitle = "Based on birthCountry",
       caption = "source: http://www.seanlahman.com/",
       x = NULL, y = "Number of birth countries")
# save
# ggsave(plot = gg_step6_col_01,
#        filename = "gg-step6-col-01.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step6_col_01

We will need to open the gg-step6-col-01.png graph in the vscode IDE (above the Terminal console).

Flip the coordinates

If we find the x axis gets cluttered and difficult to read, we can pivot the columns’ display with the ggplot2::coord_flip() function. Note that we also need to change the forcats function in mutate() to fct_reorder().

# click to execute code
gg_step6_col_02 <- People %>% 
  filter(birthCountry != "USA" & !is.na(birthCountry)) %>% 
  count(birthCountry, sort = TRUE) %>% 
  head(10) %>% 
  # reorder the birthCountry by n
  mutate(birthCountry = fct_reorder(birthCountry, n)) %>%  
  ggplot(aes(x = birthCountry, y = n, 
             fill = birthCountry)) + 
  geom_col() +
  guides(fill = FALSE) +
  # flip coordinates
  coord_flip() +
  labs(title = "Top 10 Non-US birth countries for MLB players",
       subtitle = "Based on birthCountry",
       caption = "source: http://www.seanlahman.com/",
       x = NULL, y = "Number of birth countries")
# save
# ggsave(plot = gg_step6_col_02,
#        filename = "gg-step6-col-02.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step6_col_02

We will need to open the gg-step6-col-02.png graph in the vscode IDE (above the Terminal console).

Communication tips

Bar charts and column charts display the counts for each different response in the categorical variable. We can help our audience interpret these graphs by always setting the count axis scale to zero, sorting the chart’s values to make it easier to read, and flipping the x axis if it is difficult to read.

step 7

  • included in step7.md?
  • grammar?
  • spelling?

Single variable distributions (1)

The skimr output displayed a small histogram for each numeric variable in the People dataset in the previous steps. Histograms are a special kind of bar/column chart–they show the distribution for numeric variables by ‘binning’ each response into a set number of bars/columns.

Load data

These data come from the TidyTuesday project, a data repository who’s intent is

“to provide a safe and supportive forum for individuals to practice their wrangling and data visualization skills independent of drawing conclusions.”

We’re going to use a dataset of Ramen ratings from The Ramen Rater. Read more about these data here.

Below we import the raw data from an external .csv file into Ramen and get a skimr::skim() summary (stored in RamenSkim)

Ramen <- readr::read_csv("https://bit.ly/38sO0S7")
RamenSkim <- skimr::skim(Ramen)

Review data

View the character variables in RamenSkim

RamenSkim %>% 
  skimr::yank("character")

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
brand 0 1 1 34 0 456 0
variety 0 1 3 96 0 2971 0
style 2 1 3 10 0 8 0
country 0 1 2 13 0 44 0

How complete are these data?

View the mean, standard deviation (sd), minimum (p0), median (p50), maximum (p100), and hist for the numeric variables in Ramen.

RamenSkim %>% 
  skimr::focus(numeric.mean, numeric.sd, 
               numeric.p0, numeric.p50, numeric.p100,
               numeric.hist) %>% 
  skimr::yank("numeric")

Variable type: numeric

skim_variable mean sd p0 p50 p100 hist
review_number 1590.06 917.94 1 1590.00 3180 ▇▇▇▇▇
stars 3.69 1.03 0 3.75 5 ▁▁▂▇▅

Pay attention to the hist column for stars–it shows the distribution for the values. Where are most of the values concentrated?

We will investigate the distribution of stars by building a histogram with ggplot2.

Build a histogram

We’re going to use ggplot2::geom_histogram() to view the distribution the stars variable in Ramen. Note that we are also assigning labels to the graph that includes 1) a clear title, 2) descriptive information about the graph, 3) the source of the data.

# click to execute code
gg_step7_hist_01 <- Ramen %>% 
  ggplot(aes(x = stars)) + 
  geom_histogram() + 
  labs(
       title = "Distribution of ramen stars", 
       subtitle = "bins = 30",
       caption = "source: https://www.theramenrater.com/resources-2/the-list/")
# save
# ggsave(plot = gg_step7_hist_01,
#        filename = "gg-step7-hist-01.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step7_hist_01

We will need to open the gg-step7-hist-01.png graph in the vscode IDE (above the Terminal console).

As we stated above, histograms stack the variable values into a defined set of bins. The default number for bins is 30. We can change the shape of the histogram by changing the bins argument.

Run the code below to see how the distribution looks with 20 bins. Note we also include the color = "white" argument to ensure we can see each bar separately.

# click to execute code
gg_step7_hist_02 <- Ramen %>% 
  ggplot(aes(x = stars)) + 
  geom_histogram(bins = 20, color = "white") + 
  labs(
       title = "Distribution of ramen stars", 
       subtitle = "bins = 20",
       caption = "source: https://www.theramenrater.com/resources-2/the-list/")
# save
# ggsave(plot = gg_step7_hist_02,
#        filename = "gg-step7-hist-02.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step7_hist_02

Open the gg-step7-hist-02.png graph in the vscode IDE (above the Terminal console).

The stars values fit into 20 bins better than the default 30 because we can see where values are concentrated (and the high frequency of 5-star ratings).

step 8

  • included in step8.md?
  • grammar?
  • spelling?

Single variable distributions (2)

The previous step demonstrated how to use a histogram to view the distribution of a single variable. We needed to adjust the bins in the histogram to make its shape easier to interpret. Density plots use kernel smoothing to create cleaner distributions.

Build a density plot

We’re going to use ggplot2::geom_density() to view a density plot of the stars variable in Ramen. We will use fill to color the area underneath the density line with "dodgerblue".

# click to execute code
gg_step8_density_01 <- Ramen %>% 
  ggplot(aes(x = stars)) + 
  geom_density(fill = "dodgerblue") + 
  labs(title = "Distribution of ramen stars", 
  caption = "source: https://www.theramenrater.com/resources-2/the-list/")
# save
# ggsave(plot = gg_step8_density_01,
#        filename = "gg-step8-density-01.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step8_density_01

Open the gg-step8-density-01.png graph in the vscode IDE (above the Terminal console).

Adding useful labels

Although density plots create a much smoother distribution, the y axis is harder to interpret. To overcome this, we will add two summary statistics programmatically to the labels using the base::paste0() and base::round() functions.

Run the code below to see how this works:

# click to execute code
subtitle_dens_stars <- paste0("Star rating (mean +/- sd): ", 
       # use round() to make sure there are only two decimals
       round(mean(Ramen$stars, na.rm = TRUE), 2),
       " +/- ",
       round(sd(Ramen$stars, na.rm = TRUE), 2))
subtitle_dens_stars
#> [1] "Star rating (mean +/- sd): 3.69 +/- 1.03"

We can now supply subtitle_dens_stars to the labs(subtitle = ) function.

Creating labels this way ensures they are updated whenever the underlying data change.

# click to execute code
gg_step8_density_02 <- Ramen %>% 
  ggplot(aes(x = stars)) + 
  geom_density(fill = "dodgerblue") + 
  labs(title = "Distribution of ramen stars", 
       # combine text with mean() and sd() for stars in Ramen
       subtitle = subtitle_dens_stars,
       # include source
       caption = "source: https://www.theramenrater.com/resources-2/the-list/")
# save
# ggsave(plot = gg_step8_density_02,
#        filename = "gg-step8-density-02.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step8_density_02

Open the gg-step8-density-02.png graph in the vscode IDE (above the Terminal console).

As we’ve said, an essential part of effective communication is knowing your audience. It’s unlikely these exploratory graphs will be part of our final deliverable, so the audience for these graphs will likely be us!

Using descriptive labels makes sure we know what we’re seeing when we’re viewing our graphs.

step 9

  • included in step9.md?
  • grammar?
  • spelling?

Multiple variable distributions (1)

We’ve looked at the distribution of all the values in the stars variable, but what if we were interested in the distribution of stars across the groups in another categorical variable, like style, which is the Style of container (cup, pack, tray, etc.).

We can check the levels of style with dplyr::count()

Ramen %>% 
  count(style, sort = TRUE)
#> # A tibble: 9 x 2
#>   style          n
#>   <chr>      <int>
#> 1 Pack        1832
#> 2 Bowl         612
#> 3 Cup          559
#> 4 Tray         138
#> 5 Box           32
#> 6 Restaurant     3
#> 7 <NA>           2
#> 8 Bar            1
#> 9 Can            1

The output above tells us the top five most common reviews for Ramen came from Packs, Bowls, Cups, Trays, and Boxes.

Grouped skims

We can use dplyrs filter, select, and group_by functions with skimr to see the distribution of the stars variable across the five most common style levels.

# click to execute code
Ramen %>% 
  # filter to most common styles
  filter(style %in% c("Pack", "Bowl",
                      "Cup", "Tray", "Box")) %>% 
  # select only stars and style
  select(stars, style) %>% 
  # group dataset by style
  group_by(style) %>% 
  # skim grouped data
  skim() %>% 
  # focus on select output
  skimr::focus(n_missing, style,
               numeric.mean, numeric.sd, numeric.hist,
               numeric.p0, numeric.p50, numeric.p100) %>% 
  # only return numeric values
  skimr::yank("numeric") 

Variable type: numeric

skim_variable style n_missing mean sd p0 p50 p100 hist
stars Bowl 0 3.71 1.05 0 3.75 5 ▁▁▂▇▆
stars Box 0 4.21 1.29 0 5.00 5 ▁▁▁▂▇
stars Cup 0 3.49 1.04 0 3.50 5 ▁▁▂▇▃
stars Pack 14 3.74 0.99 0 3.75 5 ▁▁▂▇▆
stars Tray 0 3.60 1.14 0 3.75 5 ▁▁▂▇▆

The output shows Ramen from a Box has the highest stars rating. We are going to confirm this with a ridgeline plot.

The ggridges package

The mean and median (p50) in the skimr output tells us the distribution of stars varies slightly for the filtered levels of style, so we will view the density for each distribution with a ridgeline plot from the ggridges package.

Install and load ggridges below:

# click to execute code
install.packages("ggridges")
library(ggridges)

Build labels first!

We’ll build the labels for this graph first in labs_ridge_stars_style, so we know what we’re expecting to see.

# click to execute code
labs_ridge_stars_style <- labs(
       title = "Star ratings by style",  
       subtitle = "Star rating across most common ramen containers",
       caption = "source: https://www.theramenrater.com/resources-2/the-list/",
       x = "Star rating", 
       y = NULL) 

I’ve found this practice to be very helpful for conceptualizing graphs before I begin building them, which reduces errors and saves time!

Overlapping density plots

The code below uses ggridges::geom_density_ridges() function to build overlapping density plots. In this plot, we map the fill argument to the style variable. We also want to set the guides(fill = ) to FALSE because we’ll have labels on the graph for each level of style.

# # click to execute code
gg_step9_ridge_01 <- Ramen %>%
  # filter to most common styles
  filter(style %in% c("Pack", "Bowl",
                      "Cup", "Tray", "Box")) %>%
  ggplot(aes(x = stars,
             y = style,
             fill = style)) +
  geom_density_ridges() +
  guides(fill = FALSE) +
  # add labels
  labs_ridge_stars_style
# # save
# ggsave(plot = gg_step9_ridge_01,
#        filename = "gg-step9-ridge-01.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step9_ridge_01

Open the gg-step9-ridge-01.png graph in the vscode IDE (above the Terminal console).

We can see that the stars ratings for the Box level in style are concentrated around 5 from the ridgeline plot.

step 10

  • included in step10.md?
  • grammar?
  • spelling?

Multiple variable distributions (2)

In the last step, we learned the distribution for Ramen stars ratings varied across the five most common levels of style. This step will view the variation of stars across style with a box-plot. Box-plots are a great way of viewing the summary statistics for a numeric variable (like stars) across multiple levels of a categorical variable (like style).

Box-plot labels

We’ll build the labels for the graph similar to the labels we used for the ridgeline plot, but we’ll be a little more explicit with the subtitle and x axis.

# click to execute code
labs_box_stars_style <- labs(
     title = "Star ratings by style",  
     subtitle = "Star ratings across pack, bowl, cup, tray, and box containers",
     caption = "source: https://www.theramenrater.com/resources-2/the-list/",
     x = "Ramen star ratings", 
     y = NULL) 

Building box-plots

We’ll filter the data to the five most common style’s again and map stars to the x axis and style to the y axis. We will also map style to the fill aesthetic inside ggplot2::geom_boxplot().

We don’t need a guide (or legend), so we will remove it with guides(fill = FALSE).

gg_step10_boxplot_01 <- Ramen %>% 
    # filter to most common styles
  filter(style %in% c("Pack", "Bowl",
                      "Cup", "Tray", "Box")) %>%
  ggplot(aes(x = stars, y = style)) + 
  geom_boxplot(aes(fill = style)) +
  guides(fill = FALSE) + 
  labs_box_stars_style
# save
# ggsave(plot = gg_step10_boxplot_01,
#        filename = "gg-step10-boxplot-01.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step10_boxplot_01

Open the gg-step10-boxplot-01.png graph in the vscode IDE (above the Terminal console).

In the next step, we will cover how to interpret the contents of a box-plot.

step 11

  • included in step11.md?
  • grammar?
  • spelling?

Multiple variable distributions (3)

Box plots display many of the numbers we see in the skimr output.

Contents of a box-plot

Click on the code below to create a skimr summary of Ramen stars ratings in `Tray’s.

# click to execute code
Ramen %>% 
  # filter to most common styles
  filter(style == "Tray") %>% 
  # select only stars and style
  select(stars, style) %>% 
  # group dataset by style
  group_by(style) %>% 
  # skim grouped data
  skimr::skim() %>% 
  # focus on select output
  skimr::focus(style,
               numeric.p0, numeric.p25, numeric.p50,
               numeric.p75, numeric.p100, numeric.hist) %>% 
  # only return numeric values
  skimr::yank("numeric") 

Variable type: numeric

skim_variable style p0 p25 p50 p75 p100 hist
stars Tray 0 3 3.75 4.25 5 ▁▁▂▇▆

We can calculate the interquartile range using dplyr below:

# click to execute code
Ramen %>% 
  # filter to most common styles
  filter(style == "Tray") %>% 
  # select only stars and style
  select(stars, style) %>% 
  # group dataset by style
  group_by(style) %>% 
  # summarize IQR
  summarize(
    `Stars/Tray IQR` = IQR(stars, na.rm = TRUE))
#> # A tibble: 1 x 2
#>   style `Stars/Tray IQR`
#>   <chr>            <dbl>
#> 1 Tray              1.25

The figure below shows a box-plot for the distribution of stars ratings across the Tray level of style. We’ve labeled the summary statistics from the skimr output on the graph. The 25th percentile (or p25) is the box’s first vertical line (or hinge). The 50th percentile (or p50) is the median and middle line of the box, and the 75th percentile (p75) is the last vertical line in the box (or hinge).

We’ve also labeled the minimum (p0) and maximum (p100) values and the interquartile range (which is similar to the standard deviation).

Communication tip

If your audience is not familiar with box-plots, the figure above is an example of supporting information to include. It explains how to read the graph, using an example from the finished chart. However, using complex plots adds mental labor to your audience and can take attention away from the point you’re trying to make. Effective communication means always using the most straightforward (or most common) graphs to reveal your findings.

step 12

  • included in step12.md?
  • grammar?
  • spelling?

Communicating column/bar charts

Now that we’ve created a few graphs, we should stop and consider what narrative information we’ve contained in each plot. We used column and bar charts for displaying the counts for two categorical variables:

  • bats, which measures whether MLB players are left-handed (L), right-handed (R), both (B), or if these data are missing (NA)
  • birthCountry, which tells us the ‘Country of Birth’ for each MLB player

We can use these graphs to convey comparison information. For example, we can see from the bar graph that most MLB players are right-handed batters.

Bar graph

The column chart displays similar information. The columns can be used for comparing the frequency of one country to the other. We’ve also arranged the columns in a way that makes it clear which country appears the most and which country appears the least.

Column graph

The text to accompany your graphs will largely depend on the context of the problem you’re trying to solve (or question you’re trying to answer), but there are a few general guidelines we can apply to each type of graph.

Communcation (labels)

Titles should be objective and neutral, expressing the “who,” “what,” and “where” of the figure’s measurements. Avoid jargon and unnecessary descriptive words. Stick with 1) what data was measured, 2) when the data was measured, and 3) how the data was counted (i.e., the units).

When you are building labels, plan on providing enough information that the chart becomes a ‘stand-alone product.’ By this, we mean that if a new observer viewed your graph, they would at minimum be able to understand what point the figure was trying to make (i.e., “this graph shows the values in X variable,” or “this figure shows the relationship between X and Y”).

Communication (distributions)

We’ve explored the distribution of the stars rating variable in the Ramen dataset by itself in the histogram and density plot and across the top five types in the ridgeline plot and box-plot.

Histogram

Density graph

Ridgeline plot

Box-plot

Distribution plots are useful if we’re answering exploratory questions about a variable before calculating statistics or building a model. Histograms, density, and ridgeline plots can quickly tell us if a variable has a normal (bell-shaped) distribution, which is a crucial assumption to check before modeling. Box-plots require a higher degree of statistical literacy for interpretation, so we recommend confirming your audience is familiar with these graphs before relying on them. Summary statistics are also vital to include with distribution graphs (usually in a supplementary table) because it tethers the figure to mathematical values.

The information from these exploratory charts gives your narrative context and frames the problem. If we were telling a story, this would be the portion that tells us the setting or universe in which our characters live.

step 13

  • included in step13.md?
  • grammar?
  • spelling?

Visualizing relationships (1)

Now that we know how to explore variable distributions, we will look at relationships between two (or more) variables. We must establish want kind of relationship we’re investigating before deciding what plot to make. We’ve already been exploring the relationship between two variables. We looked at the distribution (or spread) of stars vs. five categories of style in the Ramen dataset, and we also looked at the birthCountrys vs. n (the count of each birth country) in the People dataset.

In this step, we will look at how two numeric variables change (or vary) and if that change is in the same (or opposite) direction. One of the most common graphs for visualizing this relationship between two numerical variables is the scatterplot. A scatterplot uses points to display two numeric variables, with one variable on each axis.

Star Wars data

We will load the starwars data from the dplyr package. These data come from the Star Wars API. Read more about this dataset on dplyrs website.

# click to execute code
StarWars <- dplyr::starwars 

We will look at the relationship between height and mass for characters in the StarWars dataset. Let’s start by looking at the n_missing and complete_rate for these two variables.

# click to execute code
StarWars %>% 
  select(height, mass) %>% 
  skimr::skim() %>% 
  skimr::focus(n_missing, complete_rate) %>% 
  skimr::yank("numeric")

Variable type: numeric

skim_variable n_missing complete_rate
height 6 0.93
mass 28 0.68

We can see that almost 2/3rds of the data in mass are missing. This amount of missing data might surprise us if we didn’t explore it before plotting.

We will also view the mean, standard deviation (sd), minimum (p0), 25th percentile (p25), median (p50), 75th percentile (p75), maximum (p100), and hist for these two columns.

# click to execute code
StarWars %>% 
  select(height, mass) %>% 
  skimr::skim() %>% 
    skimr::focus(numeric.mean, numeric.sd, 
                 numeric.p0, numeric.p25, 
                 numeric.p50, numeric.p75, 
                 numeric.p100, numeric.hist) %>% 
  skimr::yank("numeric") 

Variable type: numeric

skim_variable mean sd p0 p25 p50 p75 p100 hist
height 174.36 34.77 66 167.0 180 191.0 264 ▁▁▇▅▁
mass 97.31 169.46 15 55.6 79 84.5 1358 ▇▁▁▁▁

We can see at least one value of mass that is considerably higher than the rest. We can tell because the location statistics are similar to each other (mean = 97.3, median (p50) = 84.5), but the spread is almost twice the value of the location (sd = 169). The maximum value (p100) of 1358 also confirms this finding. What is going on with this value?

Investigate outliers

It’s always a good idea to investigate values that seem implausible (like we did with the abnormally high weight for the MLB player in step 5). If we can’t figure out what is going on, we should communicate this with our stakeholders. Outliers can have a big impact on data visualizations (and statistical models), so ensuring we account for them is essential for communicating with our audience.

We’re going to filter the StarWars data only observations with mass more than 180, and select only the name, height, mass, sex and species columns (we chose 200 because it’s approximately 2x the p75 value).

StarWars %>% 
    filter(mass > 200) %>% 
    select(name, height, mass, sex, species)
#> # A tibble: 1 x 5
#>   name                  height  mass sex            species
#>   <chr>                  <int> <dbl> <chr>          <chr>  
#> 1 Jabba Desilijic Tiure    175  1358 hermaphroditic Hutt

We can now see this mass belongs to Jabba the Hutt, which makes sense if we do some additional research.

Labels

Now that we know what we’re going to visualize, we can make our labels.

# click to execute code
labs_scatter_ht_mass_01 <- labs(
  title = "Star Wars Character's height and mass", 
  x = "Mass", 
  y = "Height")

Scatterplots

We will create a scatterplot with ggplot2::geom_point() by mapping mass to the x axis and map height to the y axis.

# click to execute code
gg_step13_scatter_01 <- StarWars %>% 
  ggplot(aes(x = mass, y = height)) + 
  geom_point() + 
  # add labels
  labs_scatter_ht_mass_01
# save
# ggsave(plot = gg_step13_scatter_01,
#        filename = "gg-step13-scatter-01.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step13_scatter_01

Open the gg-step13-scatter-01.png graph in the vscode IDE (above the Terminal console). Notice how the points are all clustered on the left-hand side of the chart? The x axis has extended to account for Jabba the Hutt’s high mass value, which has made it hard to interpret the relationship among the other values.

What happens when we remove the outliers?

If you encounter outliers, it’s a good idea to view each graph with and without them to see how much they influence the plot. Let’s filter Jabba the Hutt’s mass out of the Starwars dataset and rebuild the scatterplot. We’ll also add this information to a new set of labels, so we don’t get the two graphs confused.

# click to execute code

# new labels
labs_scatter_ht_mass_02 <- labs(
  title = "Star Wars Character's height and mass", 
  subtitle = "Characters with mass less than 200",
  x = "Mass", 
  y = "Height")

# build graph
gg_step13_scatter_02 <- StarWars %>% 
  filter(mass < 200) %>% 
  ggplot(aes(x = mass, y = height)) + 
  geom_point() + 
  # add labels
  labs_scatter_ht_mass_02
# save
# ggsave(plot = gg_step13_scatter_01,
#        filename = "gg-step13-scatter-01.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step13_scatter_02

Open the gg-step13-scatter-02.png graph in the vscode IDE (above the Terminal console).

Based on the scatterplot, we can see a positive relationship between mass and height for Star Wars characters. But is this the same for all types of characters? For example, does this relationship hold for all levels of gender?

step 14

  • included in step14.md?
  • grammar?
  • spelling?

Visualizing relationships (2)

We will continue looking at the relationship between mass and height in the Starwars dataset. We looked at mass and height with and without an outlier’s influence in the previous step. In this step, we will add a categorical variable (gender) to the plot to see if the direction of the change for mass and height is the same for all levels of gender.

Counting with tabyls

Let’s view the count of gender below using the tabyl() function from the janitor package.

# click to execute code
install.packages("janitor")
library(janitor)

janitor::tabyl() works similar to dplyr::count(), but automatically prints a bit more information in the output. Click on the code block below to create a tably for the gender variable.

# click to execute code
StarWars %>% 
  janitor::tabyl(gender) 
#>     gender  n    percent valid_percent
#>   feminine 17 0.19540230     0.2048193
#>  masculine 66 0.75862069     0.7951807
#>       <NA>  4 0.04597701            NA

We can see the standard output produces a percent and valid_percent columns. We can also add percent formatting with janitor::adorn_pct_formatting():

# click to execute code
StarWars %>% 
  janitor::tabyl(gender) %>% 
  janitor::adorn_pct_formatting()
#>     gender  n percent valid_percent
#>   feminine 17   19.5%         20.5%
#>  masculine 66   75.9%         79.5%
#>       <NA>  4    4.6%             -

This output tells us 4 characters in the StarWars dataset will not show up if we use the gender variable. Read more about the tabyl function options here.

Scatterplot (3 variables)

One way to include the gender variable in the scatterplot is to map it to the color aesthetic. The output from tabyl tells us there are 4 values in gender that will be missing from this plot.

We will update our labels and add gender to the scatterplot in the code below.

# click to execute code
labs_scatter_ht_mass_gender <- labs(
  title = "Star Wars Character's gender, height and mass", 
  subtitle = "Data for gender (feminine/masculine), height, and mass < 200",
  x = "Mass", 
  color = "Gender",
  y = "Height")

gg_step14_scatter_01 <- StarWars %>% 
  filter(!is.na(gender) & mass < 200) %>% 
  ggplot(aes(x = mass, y = height, color = gender)) + 
  geom_point() +
  # add labels
  labs_scatter_ht_mass_02
# save
# ggsave(plot = gg_step14_scatter_01,
#        filename = "gg-step14-scatter-01.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step14_scatter_01

Open the gg-step14-scatter-01.png graph in the vscode IDE (above the Terminal console).

The color of the points shows that the feminine characters occupy a smaller range of values for the relationship between mass and height.

step 15

  • included in step13.md?
  • grammar?
  • spelling?

Visualizing relationships (3)

Sometimes we will want to look at how a particular measurement changes over time or trends. When we’re visualizing trends, the x axis typically the measure of time, and the y axis contains our measurement of interest. Each time point along the x axis has a corresponding value on the y axis, and lines connect these points. These lines are extended along the x axis’s full scale to display the change over time (or the trend). See the example from the FiveThirtyEight article titled, “Comic Books Are Still Made By Men, For Men And About Men”:

We’re going to re-create this chart using data from the fivethirtyeightdata package.

Comic Book Data

The fivethirtyeightdata package contains data from the FiveThirtyEight Github repository, but these data have been formatted to provide “tame data principles for introductory statistics and data science courses.

We are going to load the comic_characters dataset from the article above. We’re only interested in a subset of this dataset, so we select the relevant variables and do some initial formatting steps before assigning them to the ComicData (read more about the data here).

# click to execute code
ComicData <- read_csv("https://bit.ly/3oS1zQY") 
# subset data
ComicData <- ComicData %>% 
  # select only publisher, name, sex, year, and date
  select(publisher, name, sex, year, date) %>% 
  # filter to only the rows containing either male or female characters
  filter(sex %in% c("Female Characters", "Male Characters")) %>% 
  # convert these two variables to factors
  mutate(sex = factor(sex, 
                      levels = c("Female Characters", 
                                 "Male Characters")),
         publisher = factor(publisher, 
                            levels = c("Marvel", "DC"))) %>% 
  # remove all missing values
  drop_na()
# view
glimpse(ComicData)
#> Rows: 21,408
#> Columns: 5
#> $ publisher <fct> DC, DC, DC, DC, DC, DC, DC, DC, DC, DC, DC, DC, DC, DC, DC,…
#> $ name      <chr> "Batman (Bruce Wayne)", "Superman (Clark Kent)", "Green Lan…
#> $ sex       <fct> Male Characters, Male Characters, Male Characters, Male Cha…
#> $ year      <dbl> 1939, 1986, 1959, 1987, 1940, 1941, 1941, 1989, 1969, 1956,…
#> $ date      <date> 1939-05-01, 1986-10-01, 1959-10-01, 1987-02-01, 1940-04-01…

We formatted two variables in ComicData as factor's. We will useskimr::skim()to get an overview ofpublisherandsex`.

Factors

Factor variables are unique kinds of qualitative or categorical variables in R because they have a “fixed and known set of possible values.”. We assigned these values with the levels argument.

The skimr output below shows us the two new factor variables we’ve created in ComicNewFemalePerc.

# click to execute code
ComicData %>% 
  skimr::skim() %>% 
  skimr::yank("factor") 

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
publisher 0 1 FALSE 2 Mar: 14728, DC: 6680
sex 0 1 FALSE 2 Mal: 15834, Fem: 5574

The top_counts column tells us the counts for both levels in publisher and sex.

Summarizing data

To recreate the graph above, we’ll need to summarize the ComicData data. We need year represented on the x axis, and the percentage of female comic book characters for each publisher represented on the y axis. We can do this with dplyrs group_by(), summarize(), mutate() and ungroup() functions.

The code below creates two new variables:

  • sex_n_per_yr_pub, which is the count of each level of sex per year and publisher, and
  • sex_pct_per_yr_pub, which is the percentage of each level of sex per year and publisher

We also filter the year variable to only data between 1980 and 2010.

# click to execute code
ComicNewFemalePerc <- ComicData %>% 
  group_by(year, publisher, sex) %>% 
  summarize(sex_n_per_yr_pub = sum(n())) %>% 
  group_by(year, publisher) %>% 
  mutate(sex_pct_per_yr_pub = sex_n_per_yr_pub / sum(sex_n_per_yr_pub),
         sex_pct_per_yr_pub = round(sex_pct_per_yr_pub, digits = 3)) %>% 
  ungroup() %>% 
  filter(year > 1979 & year < 2011)
head(ComicNewFemalePerc)
#> # A tibble: 6 x 5
#>    year publisher sex               sex_n_per_yr_pub sex_pct_per_yr_pub
#>   <dbl> <fct>     <fct>                        <int>              <dbl>
#> 1  1980 Marvel    Female Characters               60              0.243
#> 2  1980 Marvel    Male Characters                187              0.757
#> 3  1980 DC        Female Characters               10              0.278
#> 4  1980 DC        Male Characters                 26              0.722
#> 5  1981 Marvel    Female Characters               58              0.271
#> 6  1981 Marvel    Male Characters                156              0.729

Labels

We will build labels identical to those in the FiveThirtyEight graph but include the original article’s URL as a caption.

# click to execute code
labs_line_comics <- labs(
  title = "Comics Aren't Gaining Many Female Characters", 
  subtitle = "Percentage of new characters who are female", 
  caption = "https://fivethirtyeight.com/features/women-in-comic-books/",
  x = NULL, 
  y = NULL)

Line graph

Now that we have our data and labels, we can build the line graph. First, we need to filter the data to only female percentages, then pass the filtered data to ggplot(aes()), mapping the year to the x axis and sex_pct_per_yr_pub to the y axis.

On the next layer, inside the ggplot2::geom_line() function, we map publisher to the group and color aesthetics inside the aes() function. Outside the aes() function, we make the lines larger by setting the size to 2.

We need to make a few more customizations to this graph to make it look like the one in the article:

  • Note the FiveThirtyEight graph does not have a legend or guide. We can remove the legend by adding a ggplot2::theme(legend.position = "none") layer.

  • The y axis in the FiveThirtyEight graph ranges from 0 to 50 and is formatted as a percent (%). Displaying percentage units helps the audience understand that a proportion is displayed (not the raw counts). We can change the formatting on the y axis with the ggplot2::scale_y_continuous() function. Set the limits argument to c(0.00, 0.50)), and the labels argument to scales::percent_format(accuracy = 1).

# click to execute code
gg_step15_line_01 <- ComicNewFemalePerc %>% 
  filter(sex == "Female Characters") %>% 
  ggplot(aes(x = year, y = sex_pct_per_yr_pub)) + 
  geom_line(aes(group = publisher, color = publisher),
            size = 2) + 
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(0.00, 0.50),
                     labels = scales::percent_format(accuracy = 1)) +
  # add labels
  labs_line_comics
# save
# ggsave(plot = gg_step15_line_01,
#        filename = "gg-step15-line-01.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step15_line_01

Open the gg-step15-line-01.png graph in the vscode IDE (above the Terminal console).

Our graph is starting to look like the figure in the article, but we still need to make a few changes. We removed the legend, so we have no way of knowing which line belongs to which publisher (DC or Marvel). In the next section, we will learn how to add these labels onto the graph near their respective lines.

step 16

  • included in step16.md?
  • grammar?
  • spelling?

Adding text annotations

In the previous step, we started recreating the following graph from the FiveThirtyEight article titled, “Comic Books Are Still Made By Men, For Men And About Men”.

We left off with the line graph in the gg-step15-line-01.png file. The FiveThirtyEight graph designers chose to remove the legend and add text labels directly to the chart, helping communicate the trend to audiences. This practice isn’t always the best choice, but it works in this graph for a few reasons, which we will cover below.

When to add text to graphs

Text on graphs is essential to communicating the graph’s contents. After all, a figure without text is just a drawing. We’ve already covered what information to include in the graph labels (i.e., the title, subtitle, caption, and x and y axes). Annotations are additional text we place in the plot area to help the audience understand what we’re trying to show.

We recommend adding text annotations as long as they aren’t obscuring or distracting the audience from the point you’re trying to make.

In the FiveThirtyEight graph, the text annotations work because there are only two levels for the measure of interest (DC and Marvel). Encoding this information in color aesthetics is helpful because we can set them to contrasting hues (i.e., blue and red). If there were more publisher levels, the color aesthetic might not be the best choice.

Lastly, the text annotations highlight the graph title’s information (“Comics Aren’t Gaining Many Female Characters”). Both annotations are in the same general area on the x axis, and their placement on the y axis is far enough from the line that the text doesn’t overlap.

Building text annotations

To get our graph closer to the FiveThirtyEight figure, we’re going to set the line colors with ggplot2::scale_color_manual(). This function takes a values argument, which we will fill with two colors, c("firebrick3", "dodgerblue").

In ggplot2, we can add text annotations with the annotate() function. Each level of publisher gets a layer and requires the following arguments:

  • geom = "text" this lets annotate() know we want text
  • x = this is the position on the x axis we want the annotation
  • y = this is the position on the y axis we want the annotation
  • label = this is the text we want to use for the annotation
  • size = 7 the size of the text
  • color = the same values we supplied to scale_color_manual()
# click to execute code
gg_step16_line_01 <- ComicNewFemalePerc %>% 
  filter(sex == "Female Characters") %>% 
  ggplot(aes(x = year, y = sex_pct_per_yr_pub)) + 
  geom_line(aes(group = publisher, color = publisher),
            size = 2) + 
  scale_y_continuous(limits = c(0.00, 0.50),
                     labels = scales::percent_format(accuracy = 1)) + 
  # color the axis
  scale_color_manual(values = c("firebrick3", "dodgerblue")) +
  # add text annotations
  annotate(geom = "text", x = 2001, y = .47, 
           label = "DC", size = 7, color = "dodgerblue") +
  annotate(geom = "text", x = 2002, y = .25, 
           label = "Marvel", size = 7, color = "firebrick3") + 
  # add labels
  labs_line_comics
# save
# ggsave(plot = gg_step16_line_01,
#        filename = "gg-step16-line-01.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step16_line_01

Open the gg-step16-line-01.png graph in the vscode IDE (above the Terminal console).

We’re almost there! We will make a few final adjustments to this graph to change the fonts displayed on the title and x and y axes.

Adjusting font attributes

The title font in the FiveThirtyEight graph is bold, and the subtitle text size is approximately the same size as the text on the x and y axes. We can adjust the font attributes in a ggplot2 graph with the theme() function. We’ve already used the theme() function in the previous step to remove the legend (legend.position = "none").

The ggplot2::theme() function comes with many options for customizing the way our graph looks. We will only cover a small number of these, and you should read more about using ggplot2 in the excellent R Graphics Cookbook, 2nd Edition by Winston Chang.

To change the font text attributes, we will set the following arguments in theme() (note that each argument also takes the ggplot2::element_text() function):

  • Change the plot.title text with element_text(size = 18, face = "bold")
  • Change the plot.subtitle text element_text(size = 15)
  • Change the axis.text.y and axis.text.x text with element_text(size = 14)
# click to execute code
gg_step16_line_02 <- ComicNewFemalePerc %>% 
  filter(sex == "Female Characters") %>% 
  ggplot(aes(x = year, y = sex_pct_per_yr_pub)) + 
  geom_line(aes(group = publisher, color = publisher),
            size = 2) + 
  scale_y_continuous(limits = c(0.00, 0.50),
                     labels = scales::percent_format(accuracy = 1)) + 
  # adjust text on labels and axes
  theme(legend.position = "none", 
        plot.title = element_text(size = 18, face = "bold"),
        plot.subtitle = element_text(size = 15),
        axis.text.y = element_text(size = 14),
        axis.text.x = element_text(size = 14)) +
  # color the axis
  scale_color_manual(values = c("firebrick3", "dodgerblue")) +
  # add text annotations
  annotate(geom = "text", x = 2001, y = .47, 
           label = "DC", size = 7, color = "dodgerblue") +
  annotate(geom = "text", x = 2002, y = .25, 
           label = "Marvel", size = 7, color = "firebrick3") + 
  # add labels
  labs_line_comics
# save
# ggsave(plot = gg_step16_line_02,
#        filename = "gg-step16-line-02.png",
#        device = "png",
#        width = 9,
#        height = 6,
#        units = "in")
gg_step16_line_02

Open the gg-step16-line-02.png graph in the vscode IDE (above the Terminal console).

Now our plot looks very close to the original graph in the FiveThirtyEight article. If you like the looks of this graph, be sure to check out the ggthemes package for the theme_fivethirtyeight() and others.

step 17

  • included in step16.md?
  • grammar?
  • spelling?

Communication (relationships)

A graph describing a relationship answers a certain kind of question. These questions can be i.e., “What is the relationship between two quantitative measures?”

When presenting a graph with relationships, consider the context and framing for the conclusions your audience will draw. Is this good news? For example, if the chart displays a drop in sales over time, anticipate how this will change your presentation’s tone, and be ready to answer questions.

It’s also important not to confuse your audience when designing graphs. Relationships with ‘good news’ should have the data points showing a positive trend (i.e., as X values increase, so do Y values), and vice-versa. You don’t want to find yourself in a situation where you’re explaining that your graph doesn’t show what you’re audience is seeing.

step 18

  • included in step17.md?
  • grammar?
  • spelling?

step 19

  • included in step18.md?
  • grammar?
  • spelling?

step 20

  • included in step19.md?
  • grammar?
  • spelling?

Other resources for missing data

Read more about visualizing missing data here and on the visdat package site, or on the inspectdf package website.

finish

  • included in finish.md?
  • grammar?
  • spelling?